This example shows how to solve a common data prep problem - how to convert a wide table to a tidy narrow table. The wide data form is common in spread sheets - especially those representing time series.
The Bureau of Labor Statistics (the BLS) data page, has many data sets and many ways to access them. For a research project we needed the monthly employment figures by "metro areas." Using the Multi-screen Data Search tool for Employment, Hours, and Earnings - State and Metro Area we captured the non-farm employment from 1995 through 2015 as a tab seperated file. See the BLS sm file spec
There is a bit clean-up to do to make it readable with read_tsv()
from Hadley's readr package. See the function BLS_CleanRawTextFile()
included in this package if you are interested in the details (which are not relevant for this dplyr example).
employment
and take a glimpse of the first and last ten columnslibrary(dplyr) library(tidyr) library(readr) library(dplyrExamples) library(stringr) library(ggplot2) library(lubridate) fn <- system.file("extdata", "BLS_NonFarmEmploymentInAreas_1995_2015.tsv", package = "dplyrExamples") file_out <- "file_out.tsv" BLS_CleanRawTextFile(fn, file_out) ## Clean up the BLS raw text file employment <- read_tsv(file_out) dim(employment) glimpse(employment[1:10]) glimpse(employment[244:253])
There are two challenges with this data set:
The series_id decoder is in Section 5 of the BLS SM file spec and is repeated here:
0 1 2 <-- 12345678901234567890 <-- character counter SMU01266207072200001 <-- Sample Series ID Positions Code Value 1-2 survey abbreviation = SM 3 seasonal (code) = U 4-5 state_code = 01 6-10 area_code = 26620 11-12 supersector_code = 70 13-18 industry_code = 70722000 19-20 data_type_code = 01
We did not need NIAC's code breakdown so positions 11-18 are all zeros in our extract.
The BLS has standard lookup files for the state and area codes which we load now.
fac <- system.file("extdata", "BLS_AreaCodes.tsv", package = "dplyrExamples") area_codes <- read_tsv(fac) fsc <- system.file("extdata", "BLS_StateCodes.tsv", package = "dplyrExamples") state_codes <- read_tsv(fsc) glimpse(area_codes) glimpse(state_codes)
Now we have everything we need to tidy up our data.
This is the dplyr sequence you would use in production. The next section breaks down the process step-by-step.
Employment_By_Area_1995_2015 <- employment %>% gather(mmm_yyyy, NonFarm_000, -Series_ID) %>% mutate(Month_Of = as.Date(paste0("01_", mmm_yyyy), format = "%d_%b_%Y"), state_code = str_sub(Series_ID, 4, 5), area_code = str_sub(Series_ID, 6, 10)) %>% left_join(state_codes) %>% left_join(area_codes) %>% mutate_each(funs(factor), ends_with("name")) %>% rename(State = state_name, Area = area_name) %>% select(State, Area, Month_Of, NonFarm_000) %>% arrange(State, Area, Month_Of) glimpse(Employment_By_Area_1995_2015)
tidyr::gather()
.eba <- employment %>% gather(mmm_yyyy, NonFarm_000, -Series_ID) glimpse(eba)
mmm_yyyy
to Date and pull out state & area codes from Series_ID.eba <- eba %>% mutate(Month_Of = as.Date(paste0("01_", mmm_yyyy), format = "%d_%b_%Y"), state_code = str_sub(Series_ID, 4, 5), area_code = str_sub(Series_ID, 6, 10)) glimpse(eba)
eba <- eba %>% left_join(state_codes) %>% left_join(area_codes) glimpse(eba)
eba <- eba %>% mutate_each(funs(factor), ends_with("name")) %>% rename(State = state_name, Area = area_name) glimpse(eba)
eba <- eba %>% select(State, Area, Month_Of, NonFarm_000) %>% arrange(State, Area, Month_Of) glimpse(eba)
Here are a cople of examples...
summary(Employment_By_Area_1995_2015) # pull out New Mexico data AreasInNM<- Employment_By_Area_1995_2015 %>% filter(State == "New Mexico", Month_Of >= as.Date("2004-01-01")) %>% mutate(Year = factor(year(Month_Of)), Month = month(Month_Of)) %>% droplevels() # get month labels for plot months <- month(seq(as.Date("2000/1/1"), by = "month", length.out = 12), label = TRUE, abbr = TRUE) ggplot(AreasInNM, aes(Month, NonFarm_000, color = Year)) + geom_point(size = I(1)) + geom_line() + scale_x_continuous(breaks = 1:12, labels = months) + ggtitle("Employment (000) by Year for Areas in New Mexico") + facet_grid(Area ~ ., scales = "free_y" )
The place to start, of course, is Hadley's vignettes in the dplyr and tidy packages. Especially Introduction to dplyr and Tidy Data.
Now that Hadley is with RStudio, search their blog for dplyr and tidyr; get the Data Wrangling Cheat Sheet; watch Data Wrangling with R & RStudio. To understand Hadley's current thinking about data analysis watch Pipelines for Data Analysis in R and The Grammar and Graphics of Data Science - the latter with Winston Chang.
Lastly, see Garrett & Hadley's chapter on data transform in their upcoming R for Data Science
We hope you have found this example of using dplyr and tidyr useful. Please send comments and suggestions to Jim at DS4CI.org or leave an issue or pull request at my github.
Thanks! Jim
file.remove(file_out)
END
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.